Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ct 854/change overzealous connection closing #428

Merged
merged 18 commits into from
Feb 7, 2023

Conversation

VersusFacit
Copy link
Contributor

@VersusFacit VersusFacit commented Jan 31, 2023

resolves #201
closes #203

Description

Each time dbt talks to a Snowflake warehouse, connections are established and reestablished repeatedly. That's a lot of time wasted on this step. Why not keep the connection(s) already opened maintained for the duration of the execution? That's what this PR does.

Teasing out the existing behavior

Set threads to 1 in your profiles.yml, without the reuse_connection param, and watch entries arrive in

select *
from table(information_schema.login_history_by_user())
order by event_timestamp desc;

New behavior

Just add a
image to your Snowflake connection in profiles.yml.
and a dbt run over Jaffle shop goes from 7 unique login entries in Snowflake to 1. The timesave is upwards of 2/3. Pretty big deal!

Why this?

@joshuataylor (❤️) gave us a proof of concept fix in https://github.com/dbt-labs/dbt-snowflake/pull/203. I simplified the code down and added lines to enable users to plug and play with profiles.yml. I also added unit testing for this.

I was worried about thread safety since self.lock is a bit of obtuse automagic. Eventually, I realized instead of reinventing the wheel, I could just super() up to SQLConnectionManager from dbt-core and it's many safety checks. Otherwise, we're kind of at the mercy of largely undocumented dbt-snowflake-connector methods.

A bug

Jeremy warns of this here. I ran into something like this once without looking for it. Every model took some <2 seconds and the last thread hung open for 8. Trials did not trigger this for any combination of client_session_keep_alive and request_connection except for the former as True and the latter as False respectively.

We could try to manually close threads but setting an explicit timeout seems sketchy to me in the opposite direction.

I thought about closing any unused threads but how would you do something like that in this Python runtime? At which point do you have it begin running over open threads and decide which to close prematurely. If anyone has the idea, by all means, I'm super game. Just unsure of how to.

Things I'm uncomfy with

Technically with this solution, the connection isn't ever closed. This shows up in the above "bug." It's just cleaned up when the process exits. Acceptable or not in cases besides the bug above?

Update: We close all thread connections in the adapter context today before trying to teardown the dbt runtime, so this has been treated.

Did you document this?

That I did. Corresponding docs PR up over yonder.

Checklist

@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-snowflake contributing guide.

@@ -151,6 +152,18 @@ def auth_args(self):
result["client_store_temporary_credential"] = True
# enable mfa token cache for linux
result["client_request_mfa_token"] = True

# Warning: This profile configuration can result in specific threads, even just one,
Copy link
Contributor Author

@VersusFacit VersusFacit Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context/Alternatives in PR description.

My proposed way of handling what appears to be a rare bug. Warn people against it. Don't encourage it.

@VersusFacit
Copy link
Contributor Author

VersusFacit commented Jan 31, 2023

Random thought: Going to check if sessions hang open in the console if able to get even more info on that "bug". Still, the questions above on how we want to proceed apply. Console lists GUI, not CI sign ins so no help there.

@nssalian nssalian self-requested a review February 1, 2023 18:10
@Fleid Fleid added pr_tracked ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering labels Feb 1, 2023
Copy link
Contributor

@McKnight-42 McKnight-42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh great pick up! wouldn't of wanted this one to disappear , all seems okay to me!!

@joshuataylor
Copy link
Contributor

🎉!!! This is great, what a fantastic solution.

Could the issue with connection hanging be related to latency? That's my gut feeling with this, is that something funky with timeouts or Snowflake having intermittent issues.

I'll see if I can run tracing with this and just keep repeating the runs until it happens. I'm ~250ms away, so we'll see how often it happens.

To test hanging connections, I'll try the following:

  1. Create a new user in Snowflake
  2. Set STATEMENT_TIMEOUT_IN_SECONDS for this user to a low value, like 15-30 seconds. This way it doesn't hold everything else up.
  3. Set threads: 4 in dbt_profiles.yml, if it doesn't trigger, try 2, if that doesn't work.. 8.
  4. Have a few small test models that will keep being rebuilt
  5. dbt run 🏃

If this works, I'll try with larger models, to see if that triggers it. I'm sure there is a correlation between either shorter runs or longer ones. Or maybe it's just random 🤷.

Copy link
Contributor

@mikealfare mikealfare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is related to the issue that we're seeing in local unit tests, and in the scenario where we don't specify the number of workers for unit tests.

@VersusFacit
Copy link
Contributor Author

VersusFacit commented Feb 6, 2023

@joshuataylor After digging in this more, my only real concern was with this behavior resulting in financial cost to users without their knowing -- appears NOT to be the case both in docs and in anecdotes from users.

I believe the bug I ran into -- where execution on one thread hung open for an unusual amount of time -- was related to intermittent snowflake bottlenecks rather than anything caused by leaving open threads in the pool.

@VersusFacit
Copy link
Contributor Author

VersusFacit commented Feb 7, 2023

Revised code to meet some design specs and confirmed the following after talking to snowflake folks:

  1. threads must be cleaned up before Python teardown due to nondeterministic order for tearing down modules the connector package depends on (huh!)
  2. buuuut, an at atexit handler for connections isn't needed
  • I ran two dozen trials on over a dozen models with all manner of thread and flag combinations -- never once did the connections hang; the cleanup_all function we indirectly call through core does the heavy lifting here
  • also, even if this cleanup weren't present, the only way leaving threads open would be problematic would be if one were using undocumented behaviors (thanks Snowflake team for that insight!)
  1. client_keep_alive_connection has no visible conflicts! So, I removed the warning code.

This one was an adventure and I thank everyone for their patience/insight in getting us to knowing these things!

Copy link
Contributor

@nssalian nssalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for digging in and investigating this to understand + confirm the underlying behavior.

Copy link
Contributor

@McKnight-42 McKnight-42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great catch on flip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-854] Cache authentication (token) for subsequent connections
6 participants